Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
نویسندگان
چکیده
Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
منابع مشابه
A Cuckoo Filter Modification Inspired by Bloom Filter
Probabilistic data structures are so popular in membership queries, network applications, and so on. Bloom Filter and Cuckoo Filter are two popular space efficient models that incorporate in set membership checking part of many important protocols. They are compact representation of data that use hash functions to randomize a set of items. Being able to store more elements while keeping a reaso...
متن کاملOptimizing Learned Bloom Filters by Sandwiching
We provide a simple method for improving the performance of the recently introduced learned Bloom filters, by showing that they perform better when the learned function is sandwiched between two Bloom filters.
متن کاملReducing False Positives of a Bloom Filter using Cross-Checking Bloom Filters
A Bloom filter is a compact data structure that supports membership queries on a set, allowing false positives. The simplicity and the excellent performance of a Bloom filter make it a standard data structure of great use in many network applications. In reducing the false positive rate of a Bloom filter, it is well known that the size of a Bloom filter and accordingly the number of hash indice...
متن کاملData Caching in Ad Hoc Networks using Bloom Filters
Data caching provides efficient data access by maintaining replicas of data in strategic parts of the network. However, current research in this area does not manage memory space of each node efficiently. We propose an improvement by considering Bloom filters, a fast, spaceefficient probabilistic method for looking up data. We compare the system the system performance with and without Bloom fil...
متن کاملCache Efficient Bloom Filters for Shared Memory Machines
Bloom filters are a well known data-structure that supports approximate set membership queries that report no false negatives. Each element in the universe represented by the bloom filter is associated with k random bits in the structure. Traditional bloom filters, therefore, require k non-local memory operations to insert an element or perform a lookups. For very large bloom filters, these k l...
متن کامل